嗨,昨天說明了透過Item Pipeline
將資料存到MongoDB
內,今天為實戰篇!我們來爬 全球新聞網的報導吧!
天氣變冷就感冒了,全身痠痛喉嚨痛.... 大家要注意保暖(Q_Q)
scrapy startproject traNews
news
的Spider
scrapy genspider news example.com
.
├── scrapy.cfg
└── traNews
├── __init__.py
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders
├── __init__.py
└── news.py
2 directories, 8 files
Spider
資料夾的news.py
,定義allowed_domains
, start_urls
,num
用來計算的頁數。import scrapy
class NewsSpider(scrapy.Spider):
name = "news"
num = 1
start_urls = ['http://blog.tranews.com/blog/category/%E6%97%85%E9%81%8A',
'http://blog.tranews.com/blog/%E7%BE%8E%E9%A3%9F',
'http://blog.tranews.com/blog/%E8%97%9D%E6%96%87',
'http://blog.tranews.com/blog/%E4%BC%91%E9%96%92']
def parse(self, response):
pass
href
在h2
且class="entry-title"
的元素內:def parse(self, response):
soup = BeautifulSoup(response.text, 'lxml')
titles = soup.select('h2.entry-title')
for t in titles:
link = t.select_one('a').get('href')
title = t.text
yield scrapy.Request(link, callback=self.article_parser)
t.text
可以取得該文章的標題,t.select_one('a')
可以取到下一層a
標籤,再用.get('href')
取得該文章的連結。這裡我們可以用meta
來儲存兩個變數,讓article_parser
這個function可以用這兩項變數。
所以我們改寫成這樣子(option):
def parse(self, response):
soup = BeautifulSoup(response.text, 'lxml')
titles = soup.select('h2.entry-title')
for t in titles:
meta = {
'title':t.text,
'link':t.select_one('a').get('href')
}
# link = t.select_one('a').get('href')
# title = t.text
yield scrapy.Request(meta['link'], callback=self.article_parser, meta=meta)
雖然這樣能抓到該頁的所有標題與連結了卻不是全部的文章,所以現在要寫切頁,透過觀察Network
可以看到往下滾可以看到它會更改最後面的`頁數,如圖:
num
變數了。概念是透過遞增num
來切換頁面,每切換一個頁面再重新爬取該頁的所有連結:def parse(self, response):
soup = BeautifulSoup(response.text, 'lxml')
titles = soup.select('h2.entry-title')
for t in titles:
meta = {
'title':t.text,
'link':t.select_one('a').get('href')
}
yield scrapy.Request(meta['link'], callback=self.article_parser, meta=meta)
self.num += 1
next_page = self.start_urls[0] +'/page/'+ str(self.num)
yield scrapy.Request(next_page, callback=self.parse)
Item
還不確定需要哪些欄位先暫時定義這幾個變數!(之後也可以再新增)class TranewsItem(scrapy.Item):
# define the fields for your item here like:
title = scrapy.Field()
link = scrapy.Field()
content = scrapy.Field()
time = scrapy.Field()
img = scrapy.Field()
Item
,在news.py
上面import:from ..items import TranewsItem
<p>
標籤組合而成的,所以我們需解析出每個p
元素的文字再將它組合起來,因為之前沒有提到Scrapy
的解析方法,所以為了方便理解我將它拆開來寫:def article_parser(self, response):
soup = BeautifulSoup(response.text, 'lxml')
article = TranewsItem()
article['title'] = response.meta['title']
article['link'] = response.meta['link']
contents = soup.select('div.entry-content p')
article['content'] = ''
for content in contents:
article['content'] = article['content'] + content.text
content = sel.css('div.entry-content p::text').extract()[1:]
article['content'] = ','.join(content)
article['img'] = soup.select_one('img').get('src')
article['time'] = soup.select_one('span.entry-date').text
整個article_parser
function寫成:
def article_parser(self, response):
soup = BeautifulSoup(response.text, 'lxml')
article = TranewsItem()
article['title'] = response.meta['title']
article['link'] = response.meta['link']
contents = soup.select('div.entry-content p')
article['content'] = ''
for content in contents:
article['content'] = article['content'] + content.text
article['img'] = soup.select_one('img').get('src')
article['time'] = soup.select_one('span.entry-date').text
return article
好的,那今天就先說明Spider
的部分怎麼寫,明天再繼續說明如何存入MySQL資料庫吧!明天見~